Ahmad Ossama Ahmad 18P6575

Maryam Mohamed Abdelrahman 18P8171

support vector machine

after searching through the features we removed the following

  1. sym 5 : as all of them have the same number (3) except for three records which are considered variances or outliars thus having no effect on the result
  2. sym 6 : as all of them have the same number (1) except for one record which is considered a variance or an outliar thus having no effect on the result
  3. vis_whan :garbage in garbage out visiting a place does not mean that you didn't quarantine in the place and take all the necessary precautions to not get infected with covid-19
  4. sym 2,3,4 : the outcome is not affected by these features

The kernel function can be any of the following:

• linear (x, x').

• polynomial: (y(x, x') + r)^d, where d is specified by parameter degree, r by coef0.

• rbf: exp(-y||x — x'||^2), where y is specified by parameter gamma, must be greater than 0.

• sigmoid tanh(y(x, x') + r), where r is specified by coef0.

Different kernels are specified by the kernel parameter:

as seen in above computation , the optimal hyperparameters are {'C': 1000, 'class_weight': None, 'coef0': 1, 'gamma': 0.01, 'kernel': 'sigmoid'}

Why are one of the correlated attributes being removed?

correlated traits have almost identical influence on the dependent variable.

correlated variables will not enhance our model and will most likely degrade it,

thus we should only use one of them.

also fewer features lead to faster learning and a more straightforward model (model with fewer features)

how to remove unimportat features?

p-values assist us in identifying the characteristics that are most important in explaining trends in the dependent variable (y).

A characteristic with a low p-value has more relevance in explaining Y trends,

whereas a feature with a high p-value has less.

Typically, a Significance Level (threshold) is specified, and any characteristic with a p-value greater than this level is removed. the method used is Backward Elimination

Backward Elimination used in remove_less_significant_features

Step 1: To stay in the model, we must first choose a significance level. (0.05)

Step 2: Complete the model by include all potential  independent variables.

Step 3: Select the predictor with the greatest P-value, for example.

If the P-value is greater than the SL, go to step 4.

Otherwise, finish, and our model is complete.

Step 4: Get rid of the predictor.

Step 5: Rebuild the model with the remaining variables and fit it.

linear SVM

image.png

Our objective is to find a hyperplane that separates +ve and -ve examples with the largest margin while keeping the misclassification as low as possible

SVM code theory

xi.w + b <= -1 if yi = -1 (belongs to -ve class)

xi.w + b >= +1 if yi = +1 (belongs to +ve class)

        or

__yi(xi.w+b) >= 1__


for all support vectors(SV) (data points which decides margin)

xi.w+b = -1 here xi is -ve SV and yi is -1

xi.w+b = +1 here xi is +ve SV and yi is +1

For decision Boundary yi(xi.w+b)=0 here xi belongs to point in decision boundary

Our Objective is to maximize Width W

W = ((X+ - X-).w)/|w|

or we can say minimize |w|

Once we have found optimized w and b using algorithm

x.w+b = 1 is line passing through +ve support vectors

x.w+b = -1 is line passing through -ve support vectors

x.w+b = 0 is decision boundary

It is not necessary that support vector lines always pass through support vectors

It is a Convex Optimization problem and will always lead to a global minimum

This is Linear SVM means kernel is linear

Start with random value of w say(w0,w0) we will decrease it later

Select step size as w0*0.1

A small value of b, we will increase it later

b will range from (-b0 < b < +b0, step = step*b_multiple)

We will check for points xi in dataset:

Check will for all transformation of w like (w0,w0), (-w0,w0), (w0,-w0), (-w0,-w0)

if not yi(xi.w+b)>=1 for all points then break

Else find |w| and put it in dictionary as key and (w,b) as values

If w<=0 then current step have been completed and go to step 6

Else decrease w as (w0-step,w0-step) and continue with step 3

Do this step until step becomes w0*0.001 because futher it will be point of expense

step = step*0.1

go to step 3

Select (w,b) which has min |w| form the dictionary

We will minimize the cost/objective function shown below:

1_JAS6rUTO7TDlrv4XZSMbsA.png

In the training phase, Larger C results in the narrow margin (for infinitely large C the SVM becomes hard margin) and smaller C results in the wider margin.

image.png first half of equation Minimize ∣∣w∣∣² which maximizes margin (2/∣∣w∣∣)

second half of the equation Minimize the sum of hinge loss which minimizes misclassifications

minimize the objective fn

using Stochastic Gradient Descent or SGD. thus only using one example per iteration

steps

Find the gradient of cost function i.e. ∇J(w’)

Move opposite to the gradient by a certain rate i.e. w’ = w’ — ∝(∇J(w’))

Repeat step 1–3 until convergence i.e we found w’ where J(w) is smallest

stopping criteria

when the current cost hasn’t decreased much as compared to the previous cost. or when epochs are more than 6000 as it would be to muh omputation

Preprocessing data

as seen through the cross validation

the best c is 10 the best ∝ is 0.00001

Decision tree

Data Analysis

No stratify is needed as the testing value has a considerate ratio of zeroes and ones as the training sample

Splitting of training data into validation and testing to check hyperparameters on the validation part

Features Scaling to consider if a feature has too much uncertainty

From the plot, maximum depth = 8 would produce the lowest mean error but it might cause overfitting so we will choose max_depth = 7

From the plot, minimum samples split could be 6. This number will be checked using the grid search to prevent overfitting

Train model and Predict

Train a gini model and Predict using the best hyperparameters

Displaying the actual labels against the predicted

Compare actual with predicted results

As shown above most of the predictons results are 0 (negative) which are predicted right. That's why True negatives have the greatest value

Precision = TP / (TP + FP)

Precision represents the ratio of positive samples predicted right of all positive samples and it is 0.99 in the 0 class as most of the predicted values were 0 and most of them are actually right. However the 1 class was only predicted 27 times which was predicted right only 21 times

Recall = TP / (TP + FN)

Recall represents the ratio of positive samples predicted right of all actual samples and it is too many in the 0 class as there are too many predicted 0 values which are right and very few false negatives around (2).

Visualize Decision tree

The best starting split was age feature as it has the lowest gini impurity as calculated above

Draw Decision Tree using entropy

From the plot there are many close maximum depths so we will overtune hyperparameters using gridsearch

Gini indexed model produces higher accuracy than entropy as the data sample is numerical which is more suited by Gini